Automatic Extraction of Analytical Chemical Information. System Description, Inventory of Tasks and Problems, and Preliminary Results
نویسندگان
چکیده
a n a ly t ica l in fo rm ation Figure 1. Overview of the different modules of the system. —domain independency —semiautomatic: the user is consulted if the system encounters problems. The last starting point is based on the aforementioned publications concerning automatic information extracting which reveal that information extraction based on Natural Language techniques is still very difficult with varying percentages of accurate results. In such a situation a semiautomatic system is expected to perform better than a fully automatic system. The other starting points will be motivated in the next sections. The system should initially be developed for a test-set of 124 abstracts. THEORY AND IMPLEMENTATION Text analysis normally consists of lexical, syntactic, semantic, and discourse analysis as is also depicted in Figure 1. The various tasks can be integrated in single modules that execute them concurrently or intertwined. An advantage of integration is that the semantics can be used as soon as possible in order to limit the number of possible solutions generated by the syntactic analysis.' A disadvantage is that the maintainability decreases strongly as the system grows. This was experienced during the development of a previous system8 as well This, together with the aforementioned requirement of a robust syntax and semantics, motivated the choice for separated modules. Other advantages are the following: the modules can be developed independently (by different people), a module can be exchanged by one based on other principles, if this is required, and a modular structure gives a better insight of the specific types of knowledge that are necessary for the different modules (and submodules). Another requirement of the system was that it should be as independent as possible of the domain. This was implemented by locating the domain dependent procedures and information in separate modules and files (mainly the lexica). In this way results of other investigations can be implemented more easily, in case of problems, its origin can be determined more easily (domain specific or general the system can be used for other domains as well Also, it facilitates a better understanding of all phenomena that play a role. The lexical module18 consists of two lexica, a morphologi cal analyzer for words and a lexical postprocessor mainly recognizing word groups. The first lexicon is filled with (domain dependent) single words. Its structure is domain independent: it contains entries for (the stem forms of) words and abbreviations, together with their syntactic categories (nouns, verbs, etc.), semantic categories, and if necessary a reference to an enumeration of the roles, the role identifying prepositions (if applicable), and the expected semantic classes of participants that are linked by the given roles to the words (semantic selection restriction frames). The semantic selec tion restriction frames are stored in a separate file and are indexed by numbers (in order to save space: a frame can apply to more than one verb). The second lexicon contains concepts that consist of more than one word. The morphological analyzer contains the normal domain independent morphological rules (functions) that deal with declensions of words, the recognition of adverbs that are derived from adjectives, and the handling of plurals. Beside this, it contains functions for the recognition of numerals, and it contains separate domain dependent functions for the recognition of inorganic structural formulas (e.g., NasCOa) so that these need not be given in the lexicon (besides those that need extra semantic subcategorization). The abstracts contain a number of complex strings that need special attention. These are in fact combinations of words which are not separated by spaces but need to be separated. Examples are A— .0.049 and 0.02M-HCI (A— . stands for d=; in general Analytical Abstracts encodes all non-ASCII characters as strings between dots). These are tackled by a separate procedure as well. It checks whether components occur in the lexicon, first taking into account the possible existence of abbreviations, punctuation marks, chemical formulas, and numbers. If a full stop occurs at the end and cannot be recognized as being part of the abbreviations it is recognized as sentence end. The morphological analyzer is implemented as a SPITBOL26 program. Each string between spaces is checked for occurrence in the first lexicon, and if there is no entry the various morphological rules and the above mentioned procedures are applied after which the first lexicon is consulted again. The postprocessor is called after the morphological analyzer using its output as input. It deals with a number of (domain dependent) concepts that consist of more than one word. Compound words and idiomatic expressions (that syntactically can be viewed upon as one word, like “with respect to”) are recognized by consulting the second lexicon after which the component words and their categories are replaced by the compound term (or idiomatic expression). The postprocessor deals with complex chemical compound names by looking in the first lexicon for its parts. All possible parts are labeled, and if the labels agree with each other the parts are replaced by the compound name with one set of syntactic and semantic categories. This way complex chemical compound names need not to be stored in the lexicon. The background of this procedure is that the set of parts is limited, which is contrary to the size of the set of linguistic) and research can be directed to it, and parts of chemical compound names. 772 J. Chem. Inf. Comput Sci., Vol. 36, No. 4, 1996 POSTMA ET AL. Lexical ambiguous words get all the word classes that are possible. A choice after the proper one is made during the parsing process. If a word cannot be processed and classified during the lexical phase the user is automatically asked for all the lexical information. A dedicated user-friendly lexicon editor is being developed. The parser is based on Chomsky’s principles of Govern ment and Binding19,20 for the syntactic part and Montague semantics for the semantic part. Chomsky’s principles of Government and Binding are syntax oriented. It is based on general linguistic principles and this basis should lead to a more robust parser; its choice is motivated by research interest in the application possibilities of its theory as well. One of the features is that it works with general languagewide templates instead of far more language specific phrasestructure rules. Its appealing features are, for instance, described by McHale and Myaeng.14 The theory does not postulate a strict formalism; it is implemented as a trans formational grammar in GRAMTSY (a so called “transfor mational driver”; for more information, see ref 21). The choice of Montague semantics is motivated by the solid logical foundation. The output of the parser is not an intensional logic representation, however, but a predicate logic representation. A discussion on possible critics and a motivation of the choices made are given in more detail by van Bakel.22 The parser uses “underspecification” as a principle in order to eliminate combinatorial explosion as a result of ambigu ities that cannot be resolved in the different modules.23 A choice between multiple possible solutions is postponed, using some general notation, to the module that is capable of resolving it. This prevents the generation and testing of numerous solutions in order to locate the correct one. An example for which it is used is “The determination of clemastine fumerate in .... with ... by a number of prepositional phrases follow a verb or nounphrase, and the syntax cannot determine whether the second and third prepositional phrase is connected to the verb or one of the previous prepositional phrases, resulting in a reasonable number of possible combinations. In this example the semantics module will determine the correct connections using the selection restriction frames of the various words (see later in this section). The parser consists of a syntactic module, a semantic module, and a postprocessor. The syntactic module consists of a submodule for a context-free analysis producing a surface structure and a second submodule which executes a transformational analysis. The first submodule is based on a context-free rewrite grammar according to the Extended Affix Grammar (EAG) formalism.24 The grammar is converted into a parser by the parser generator GRAMMA.24 The surface structure is a decomposition of the sentence into its syntactic categories (verb phrase, noun phrase, preposi» tional phrase, verb, etc.). It can be represented as a tree, see Figure 2 for an example (first decomposition tree). In that figure, S stands for sentence, NP for noun phrase, AUX for auxiliaries, VP for verb phrase, V for verb, etc. A number of intermediate nodes are added for grouping various nodes (syntactic categories) on various levels, some of which do not occur in the current sentence. The strings between the square brackets identify the various syntactic and semantic features of the nodes on the given positions or originate from the lexicon entries of the words. For instance, s)E SB(AR g Ç+Î in] S,U NP N2 AUX Ijvdp] V [^passive]
منابع مشابه
برچسبزنی خودکار نقشهای معنایی در جملات فارسی به کمک درختهای وابستگی
Automatic identification of words with semantic roles (such as Agent, Patient, Source, etc.) in sentences and attaching correct semantic roles to them, may lead to improvement in many natural language processing tasks including information extraction, question answering, text summarization and machine translation. Semantic role labeling systems usually take advantage of syntactic parsing and th...
متن کاملDevelopment of an Automatic Land Use Extraction System in Urban Areas using VHR Aerial Imagery and GIS Vector Data
Lack of detailed land use (LU) information and efficient data collection methods have made the modeling of urban systems difficult. This study aims to develop a novel hierarchical rule-based LU extraction framework using geographic vector and remotely sensed (RS) data, in order to extract detailed subzonal LU information, residential LU in this study. The LU extraction system is developed to ex...
متن کاملتشخیص اسامی اشخاص با استفاده از تزریق کلمههای نامزد اسم در میدانهای تصادفی شرطی برای زبان عربی
Named Entity Recognition and Extraction are very important tasks for discovering proper names including persons, locations, date, and time, inside electronic textual resources. Accurate named entity recognition system is an essential utility to resolve fundamental problems in question answering systems, summary extraction, information retrieval and extraction, machine translation, video interpr...
متن کاملUsing Information Gap and Opinion Gap Tasks to Improve Introvert and Extrovert Learners’ Speaking
This study compared the effect of information gap and opinion gap tasks on introvert and extrovert EFL learners’ speaking. Accordingly, 138 learners out of 180 intermediate learners were chosen through their scores on a sample Preliminary English Test (PET). These learners further responded to the Eysenck Personality Inventory (EPI) which categorized them into introverts and extroverts. Altoget...
متن کاملOptimization of two-stage production/inventory systems under order base stock policy with advance demand information
It is important to share demand information among the members in supply chains. In recent years, production and inventory systems with advance demand information (ADI) have been discussed, where advance demand information means the information of demand which the decision maker obtains before the corresponding actual demand arrives. Appropriate production and inventory control using demand info...
متن کاملAutomatic Workflow Generation and Modification by Enterprise Ontologies and Documents
This article presents a novel method and development paradigm that proposes a general template for an enterprise information structure and allows for the automatic generation and modification of enterprise workflows. This dynamically integrated workflow development approach utilises a conceptual ontology of domain processes and tasks, enterprise charts, and enterprise entities. It also suggests...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of Chemical Information and Computer Sciences
دوره 36 شماره
صفحات -
تاریخ انتشار 1996